InterPreTS is a tool designed to predict interaction based on the tertiary structure of proteins. To use it, we require FASTA files for each of the proteins we're interested in using. The greedy approach would be to attempt to get data for all the known human proteins from the gene database. These were already found when working on the Gene Ontology features.
Again, using the NCBI gene database and Biopython it's possible to find FASTA formatted records for each of these proteins. Then the tool can be downloaded from the InterPreTS website and predictions found that can be used as features.
Loading in the list of known human proteins:
In [1]:
cd ../../geneconversion/
It turns out that other people have tried to do this before and it can be done using the gene2refseq
conversion table and the batch Entrez tool here.
Instead of using the script defined there using Python to do the mapping from Entrez Gene ID to protein accession number:
In [3]:
import csv
In [4]:
#open file and filter line by tax id, then match to the human id to the protein id
f = open("gene2refseq")
c = csv.reader(f, delimiter="\t")
geneidtoprotein = {}
for line in c:
# make sure they're human proteins
if line[0] == "9606":
try:
geneidtoprotein[line[1]] += [line[6]]
except KeyError:
geneidtoprotein[line[1]] = [line[6]]
f.close()
In [5]:
counts = []
for k in geneidtoprotein.keys():
counts.append(len(geneidtoprotein[k]))
In [6]:
h=hist(counts, bins=arange(0,50)+0.5, range=(0,50))
In [12]:
proteinids = list(flatten(geneidtoprotein.values()))
print "Before removing duplicates %i protein gi IDs"%len(proteinids)
proteinids = set(proteinids)
print "After removing duplicates %i protein gi IDs"%len(proteinids)
In [13]:
f = open("human.proteingi.txt","w")
for p in proteinids:
f.write("{0}\n".format(p))
f.close()
Then this file is submitted to the [batch Entrez service, specifying Protein as the database. Tried this, but 72927 is too large for the service. It recommends downloading all proteins from the NCBI ftp service. Looking around it appears what we want is the file here.
It's described in the README as:
The RNA and protein directories provide sequence files in three formats representing all of the mRNA, non-coding transcript, and protein model reference sequences (RefSeq) exported as part of the genome annotation process.
Downloaded this and gunzipped it. Now have to parse this FASTA file with Biopython and match it to the protein database accession number:
In [17]:
from Bio import Entrez,SeqIO
Entrez.email = "gavingray1729@gmail.com"
In [50]:
proteinrecords = list(SeqIO.parse("protein.fa","fasta"))
In [40]:
print proteinrecords[0]
In [52]:
proteintorecord = {}
for r in proteinrecords:
#get the accession number
accnum = r.id.split("|")[1]
#use that to index this record
proteintorecord[accnum] = r
In [56]:
genetorecord = {}
for k in geneidtoprotein.keys():
for p in geneidtoprotein[k]:
try:
genetorecord[k] += [proteintorecord[p]]
except KeyError:
try:
genetorecord[k] = [proteintorecord[p]]
except KeyError:
#ignore invalid keys
pass
In [59]:
# add Entrez annotations to FASTA files
for k in genetorecord.keys():
for r in genetorecord[k]:
r.description += " Gene ID: {0}".format(k)
In [62]:
# write new FASTA file
SeqIO.write(flatten(genetorecord.values()),"human.protein.fasta","fasta")
Out[62]:
Each Gene ID pair will map to multiple FASTA pairs and each of these will have to be evaluated by the InterPreTS tool. These different isoforms are likely to interact differently, as shown in this paper. Trying an example pair:
In [67]:
pair = (genetorecord.keys()[0],genetorecord.keys()[100])
In [68]:
print pair
Writing a record from each of these identifiers to file:
In [76]:
SeqIO.write([genetorecord[pair[0]][0], genetorecord[pair[1]][0]],"example.pair.fasta","fasta")
Out[76]:
Web tool did not return anything, no hits. Trying with more proteins.
In [73]:
examplelist = []
for k in genetorecord.keys()[:20]:
examplelist.append(genetorecord[k][0])
In [75]:
SeqIO.write(examplelist,"example.group.fasta","fasta")
Out[75]:
This produces some results which will be viewable for the next six weeks here.
In [77]:
import scipy.misc
In [78]:
scipy.misc.comb(len(genetorecord.keys()),2)
Out[78]:
In [79]:
import pickle
In [80]:
# saving the table
f = open("gene.to.protein.human.txt","w")
for k in geneidtoprotein:
for p in geneidtoprotein[k]:
f.write("{0}\t{1}".format(k,p))
f.close()
In [81]:
#pickling the dictionaries
f = open("gene.to.protein.pickle","wb")
pickle.dump(geneidtoprotein,f)
f.close()
f = open("gene.to.record","wb")
pickle.dump(genetorecord,f)
f.close()